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The /c-means method is a widely used clustering algorithm. One of its distin- 
guished features is its speed in practice. Its worst-case running-time, however, is ex- 
ponential, leaving a gap between practical and theoretical performance. Arthur and 
Vassilvitskii [3] aimed at closing this gap, and they proved a bound of poly(n fc , a -1 ) 
on the smoothed running-time of the fc-means method, where n is the number of 
data points and a is the standard deviation of the Gaussian perturbation. This 
bound, though better than the worst-case bound, is still much larger than the 
running-time observed in practice. 

We improve the smoothed analysis of the /c-means method by showing two upper 
bounds on the expected running-time of £>means. First, we prove that the expected 
running-time is bounded by a polynomial in and cr _1 . Second, we prove an 
upper bound of k kd ■ poly(n, <r -1 ), where d is the dimension of the data space. The 
polynomial is independent of k and d, and we obtain a polynomial bound for the 
expected running-time for k,d 6 0( -y/log nj log log n). 

Finally, we show that fc-means runs in smoothed polynomial time for one-dimen- 
sional instances. 

1 Introduction 

The /c-means method is a very popular algorithm for clustering high-dimensional data. It 
is based on ideas by Lloyd [10]. It is a local search algorithm: Initiated with k arbitrary 
cluster centers, it assigns every data point to its nearest center, and then readjusts the centers, 
reassigns the data points, . . . until it stabilizes. (In Section 11.11 we describe the algorithm 
formally.) The fe-means method terminates in a local optimum, which might be far worse than 
the global optimum. However, in practice it works very well. It is particularly popular because 
of its simplicity and its speed: "In practice, the number of iterations is much less than the 
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number of samples", as Duda ct al. [6, Section 10.4.3] put it. According to Berkhin [5], the 
/c-means method "is by far the most popular clustering tool used in scientific and industrial 
applications." 

The practical performance and popularity of the /c-means method is at stark contrast to 
its performance in theory. The only upper bounds for its running-time are based on the 
observation that no clustering appears twice in a run of /c-means: Obviously, n points can be 
distributed among k clusters in only k n ways. Furthermore, the number of Voronoi partitions 
of n points in M. d into k classes is bounded by a polynomial in n kd [8], which yields an upper 
bound of poly(n fcd ). On the other hand, Arthur and Vassilvitskii [2] showed that k- means can 
run for 2 f ^ v/ ™* ) iterations in the worst case. 

To close the gap between good practical and poor theoretical performance of algorithms, 
Spielman and Teng introduced the notion of smoothed analysis [12]: An adversary specifies 
an instance, and this instance is then subject to slight random perturbations. The smoothed 
running-time is the maximum over the adversarial choices of the expected running-time. On 
the one hand, this rules out pathological, isolated worst-case instances. On the other hand, 
smoothed analysis, unlike average-case analysis, is not dominated by random instances since 
the instances are not completely random; random instances are usually not typical instances 
and have special properties with high probability. Thus, smoothed analysis also circumvents 
the drawbacks of average-case analysis. For a survey of smoothed analysis, we refer to Spielman 
and Teng [13]. 

The goal of this paper is to bound the smoothed running-time of the /c-means method. There 
are basically two reasons why the smoothed running-time of the /c-means method is a more 
realistic measure than its worst-case running-time: First, data obtained from measurements 
is inherently noisy. So even if the original data were a bad instance for /c-means, the data 
measured is most likely a slight perturbation of it. Second, if the data possesses a meaningful 
/c-clustering, then slightly perturbing the data should preserve this clustering. Thus, smoothed 
analysis might help to obtain a faster /c-means method: We take the data measured, perturb 
it slightly, and then run /c-means on the perturbed instance. The bounds for the smoothed 
running-time carry over to this variant of the A;- means method. 

1.1 fc-Means Method 

An instance for /c-means clustering is a point set X C M. d consisting of n points. The aim is to 
find a clustering C±, . . . , of X, i.e., a partition of X, as well as cluster centers such that the 
potential 

k 

i=l x£d 

is minimized. Given the cluster centers, every data point should obviously be assigned to the 
cluster whose center is closed to it. The name /c-means stems from the fact that, given the 
clusters, the centers ci, . . . , should be chosen as the centers of mass, i.e., Cj = j^-j YlxeC, x 
of Ci. The /c-means method proceeds now as follows: 

1. Select cluster centers c\, . . . , 

2. Assign every x G X to the cluster Ci whose cluster center Cj is closest to it. 

3. Set Ci = Y^x£C t x - 
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4. If clusters or centers have changed, goto El Otherwise, terminate. 

Since the potential decreases in every step, no clustering occurs twice, and the algorithm 
eventually terminates. 

1.2 Related Work 

The problem of finding a good clustering can be approximated arbitrarily well: Badoiu et al. [4], 
Matousek [11], and Kumar et al. [9] devised polynomial time approximation schemes with 
different dependencies on the approximation ratio (1+e) as well as n, k, andd: 0(2°( ke 2lo s fc )- 
nd), 0{ne~ 2k d log k n), and 0(exp(/c/e) • nd), respectively. 

While the polynomial time approximation schemes show that /c-means clustering can be 
approximated arbitrarily well, the method of choice for finding a ^-clustering is the /c-means 
method due to its performance in practice. However, the only polynomial bound for /c-means 
holds for d = 1, and only for instances with polynomial spread [7], which is the maximum 
distance of points divided by the minimum distance. 

Arthur and Vassilvitskii [3] have analyzed the running-time of the /c-means method sub- 
ject to Gaussian perturbation: The points are drawn according to independent d-dimensional 
Gaussian distributions with standard deviation a. Arthur and Vassilvitskii proved that the 
expected running-time after perturbing the input with Gaussians with standard deviation a is 
polynomial in n k , d, the diameter of the perturbed point set, and 1/a. 

Recently, Arthur [1] showed that the probability that the running-time of /c-means subject 
to Gaussian perturbations exceeds a polynomial in n, d, the diameter of the instance, and 1 /a 
is bounded by 0{l/n). However, his argument does not yield any significant bound on the 
expected running-time of /c-means: The probability of 0(l/n) that the running-time exceeds a 
polynomial bound is too large to yield an upper bound for the expected running-time, except 
for the trivial upper bound of poly(n fcd ). 

1.3 New Results 

We improve the smoothed analysis of the fc-means method by proving two upper bounds on 
its running-time. First, we show that the smoothed running-time of /c-means is bounded by a 
polynomial in and 1/a. 

Theorem 1. Let X C M. d be a set of n points drawn according to independent Gaussian 
distributions whose means are in [0, l] d . Then the expected running-time of the k-means method 
on the instance X is bounded from above by a polynomial in and 1/a. 

Thus, compared to the previously known bound, we decrease the exponent by a factor of y/k. 
Second, we show that the smoothed running-time of /c-means is bounded by k kd • poly(n, 1/cr). 
In particular, this decouples the exponential part of the bound from the number n of points. 

Theorem 2. Let X be drawn as described in Theorem^ Then the expected running-time of 
the k-means method on the instance X is bounded from above by k kd ■ poly(n, 1/a) . 

An immediate consequence of Theorem is the following corollary, which proves that the 
expected running-time is polynomial in n and 1/a if k and d are small compared to n. This 
result is of particular interest since d and k are usually much smaller than n. 
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Corollary 3. Let k,d G O ( -y/log n / log log n) . Let X be drawn as described in Theorem^ 
Then the expected running-time of k-means on the instance X is bounded by a polynomial in 
n and 1/a. 

David Arthur [1] presented an insightful proof that /c-means runs in time polynomial in 
n, l/o~, and the diameter of the instance with a probability of at least 1 — 0(l/n). It is 
worth pointing out that his result is orthogonal to our results: neither do our results imply 
polynomial running time with probability 1 — 0(l/n), nor does Arthur's result yield any 
non-trivial bound on the expected running-time (not even poly(n fc , 1/cr)) since the success 
probability of 1 — 0(l/n) is way too small. The exception is our result for d = 1, which 
yields not only a bound on the expectation, but also a bound that holds with high probability. 
However, the original definition of smoothed analysis [12] is in terms of expectation, not in 
terms of bounds that hold with a probability of 1 — o(l). 

To prove our bounds, we prove a lemma about perturbed point sets (Lemma [5]). The lemma 
bounds the number of points close to the boundaries of Voronoi partitions that arise during the 
execution of /j-means. It might be of independent interest, in particular for smoothed analyses 
of geometric algorithms and problems. 

Finally, we prove a polynomial bound for the running-time of fe-means in one dimension. 

Theorem 4. Let X C R be drawn according to 1-dimensional Gaussian distributions as de- 
scribed in Theorem Then the expected running-time of k-means on X is polynomial in n 
and 1/a. Furthermore, the probability that the running-time exceeds a polynomial in n and 
1/a is bounded by l/poly(n). 

We remark that this result for d = 1 is not implied by the result of Har-Peled and Sadri [7] 
that the running-time of one-dimensional A:-means is polynomial in n and the spread of the 
instance. The reason is that the expected value of the square of the spread is unbounded. 

The restriction of the adversarial points to be in [0, l] d is necessary: Without any bound, 
the adversary can place the points arbitrarily far away, thus diminishing the effect of the 
perturbation. We can get rid of this restriction and obtain the same results by allowing the 
bounds to be polynomial in the diameter of the adversarial instance. However, for the sake of 
clarity and to avoid another parameter, we have chosen the former model. 

1.4 Outline 

To prove our two main theorems, we first prove a property of perturbed point sets (Section [2]): 
In any step of the A:-means algorithm, there are not too many points close to any of the at most 
k 2 hyperplanes that bisect the centers and that form the Voronoi regions. To put it another 
way: No matter how /c-means partitions the point set X into k Voronoi regions, the number 
of points close to any boundary is rather small with overwhelming probability. 

We use this lemma in Section First, we use it to prove Lemma El which bounds the 
expected number of iterations in terms of the smallest possible distance of two clusters. Using 
this bound, we derive a first upper bound for the expected number of iterations (Lemma [9]), 
which will result in Theorem [2] later on. 

In Sections H] and (5J we distinguish between iterations in which at most \fk or at least y/k 
clusters gain or lose points. This will result in Theorem [H 

We consider the special case of d = 1 in Section [6l For this case, we prove an upper bound 
polynomial in n and 1/a until the potential has dropped by at least 1. 
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In Sections EJ HJ \5\ and [6] we are only concerned with bounding the number of iterations 
until the potential has dropped by at least 1. Using these bounds and an upper bound on the 
potential after the first round, we will derive Theorems [H O and H] as well as Corollary [3] in 
Section [7J 

1.5 Preliminaries 

In the following, X is the perturbed instance on which we run /c-means, i.e., X = {x±, . . . , x n } C 
W 1 is a set of n points, where each point Xj is drawn according to a d-dimensional Gaussian 
distribution with mean ^ G [0, l] d and standard deviation a. 

Inaba et al. [8] proved that the number of iterations of fc-means is poly(n fcd ) in the worst 
case. We abbreviate this bound by W < n Kkd for some constant k in the following. 

Let D > 1 be chosen such that, with a probability of at least 1 — IF -1 , every data point from 
X lies in the hypercube T> := \—D, 1 + D] d after the perturbation. In Section we prove that 
D can be bounded by a polynomial in n and a, and we use this fact in the following sections. 
We denote by T the failure event that there exists one point in X that does not lie in the 
hypercube T> after the perturbation. We say that a cluster is active in an iteration if it gains 
or loses at least one point. 

We will always assume in the following that d < n and k < n, and we will frequently bound 
both d and k by n to simplify calculations. Of course, k < n holds for every meaningful instance 
since it does not make sense to partition n points into more than n clusters. Furthermore, 
we can assume d < n for two reasons: First, the dimension is usually much smaller than the 
number of points, and, second, if d > n, then we can project the points to a lower-dimensional 
subspace without changing anything. 

Let C = {Ci, . . . , Cfc} denote the set of clusters. For a natural number k, let [k] = {1, . . . , A;}. 
In the following, we will assume that number such as Vk are integers. For the sake of clarity, 
we do not write down the tedious floor and ceiling functions that are actually necessary. Since 
we are only interested in the asymptotics, this does not affect the validity of the proofs. 
Furthermore, we assume in the following sections that a < 1. This assumption is only made 
to simplify the arguments and we describe in Section [7J how to get rid of it. 

2 A Property of Perturbed Point Sets 

The following lemma shows that, with high probability, there are not too many points close to 
the hyperplanes dividing the clusters. It is crucial for our bounds for the smoothed running- 
time: If not too many points are close to the bisecting hyperplanes, then, eventually, one point 
that is further away from the bisecting hyperplanes must go from one cluster to another, which 
causes a significant decrease of the potential. 

Lemma 5. Let a G [k] be arbitrary. With a probability of at least 1 — 2W~ 1 , the following 
holds: In every step of the k-means algorithm (except for the first one) in which at least kd/a 
points change their assignment, at least one of these points has a distance larger than 




from the bisector that it crosses. 
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Proof. We consider a step of the fc-means algorithm, and we refer to the configuration before 
this step as the first configuration and to the configuration after this step as the second config- 
uration. To be precise, we assume that in the first configuration the positions of the centers are 
the centers of mass of the points assigned to them in this configuration. The step we consider 
is the reassignment of the points according to the Voronoi diagram in the first configuration. 

Let BOX with \B\ = t := kd/a be a set of points that change their assignment during 
the step. There are at most n choices for the points in B and at most k 2i < n 2i choices for 
the clusters they are assigned to in the first and the second configuration. We apply a union 
bound over all these at most n 3 ^ choices. 

The following sets are defined for all i,j £ [k] and j 7^ i. Let Bi C B be the set of points 
that leave cluster Ci. Let Bij C Bi be the set of points assigned to cluster in the first and 
to cluster Cj in the second configuration, i.e., the points in Bij leave Ci and enter Cj. We have 



Let Ai be the set of points that are in Ci in the first configuration except for those in Bi. 
We assume that the positions of the points in Ai are determined by an adversary. Since the 
sets A±, . . . , Ak form a partition of the points in X \ B that has been obtained in the previous 
step on the basis of a Voronoi diagram, there are at most W choices for this partition [8] . We 
also apply a union bound over the choices for this partition. 

In the first configuration, exactly the points in A4 U Bi are assigned to cluster Cj. Let 
ci, . . . , Cfc denote the positions of the cluster centers in the first configuration, that is, Cj is the 
center of mass of Ai U Bi. Since the positions of the points in X \ B are assumed to be fixed 
by an adversary, and since we apply a union bound over the partition A±, . . . , A^, the impact 
of the set Ai on position Cj is fixed. However, we want to exploit the randomness of the points 
in Bi in the following. Thus, the positions of the centers are not fixed yet but they depend on 
the randomness of the points in B. In particular, the bisecting hyperplane Hi j of the clusters 
Ci and Cj is not fixed but depends on Bi and Bj . 

In order to complete the proof, we have to estimate the probability of the event 



where dist(x, H) = min^^Hx — y\\ denotes the shortest distance of a point x to a hyperplane H. 
In the following, we denote this event by £ . If the hyperplanes Hi j were fixed, the probability 



since their positions and orientations depend on the points in the sets Bi j. Therefore, we are 
only able to prove the following weaker bound in Lemma 



where —>J- denotes the event that, after the perturbation, all points of X lie in the hypercube 
V = [-D, D + l] d . Now the union bound yields the following upper bound on the probability 




of £ could readily be seen to be at most 
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that a set B with the stated properties exists: 



Pr[£] < Pr[f A-iJ 7 ] + Pr[jF] 

'" 11 - (v) ■(— » +W 

^ \ fed 



< n- Kkd + W~ l < 2W 



-l 



The equation is by our choice of e, the inequalities are due to some simplifications and W < 
n Kkd . □ 



Lemma 6. The probability of the event £ A is bounded from above by 

W^ kd / 32nW £ y /4 

Proof. Let gi be the random vector that equals the sum of the points in Bi, i.e., 

9t ■= Yl b ■ 

beBi 

Due to the union bound, the influence of Ai and Aj on the hyperplane Hij is fixed. Since the 
union bound also fixes the number of points in Bi and Bj, it suffices to know the sums gi and 
gj to deduce the exact position of the hyperplane Hij. Hence, once all sums gi are fixed, all 
hyperplanes are fixed as well. The drawback is, of course, that fixing the sums gi has an impact 
on the distribution of the random positions of the points in Bi. We circumvent this problem 
as follows: We basically show that if Bi contains m points and the sum gi is fixed, then we can 
still use the randomness of m — 1 of these points. For sets Bi that contain at least two points, 
this means that we can use the randomness of at least half of its points. Complications are 
only caused by sets Bi that consist of a single point. For such sets, fixing gi is equivalent to 
fixing the position of the point, and we give a more direct analysis without fixing gi. 

For yi,yj £ M. d , we denote by Hij(yi,yj) the bisector of the clusters and Cj that is 
obtained for g^ = y, t and gj = yj. Let k* be the number of clusters Ci with \B-i\ > 0. Without 
loss of generality, these are the clusters C\ , . . . , . This convention allows us to rewrite the 
probability of £ A as 

Pr [Vi, j : Vb € B id : dist(6, H id ) < e A -<F] < jf • • • jf ^JJ UiVi^j 

• Pr[Vi, j: V6 G B i:j : dist(6, Hijfa, yj)) < e \ Vz: & = y*] dy k * . . . dyi , 

where f 9i is the density of the random vector gi. We admit that our notation is a bit sloppy: 
If \Bij\ > and j ^ {1, . . . , k*}, then Hij depends only on yj. In this case, we should actually 
write Hi t j(yi) instead of Hij(yi,yj) in the formula above. In order to keep the notation less 
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cumbersome, we ignore this subtlety and assume that Hij(yi,yjJ is implicitly replaced by 
Hij(yi) whenever necessary. Points from different sets Bi and Bj are independent even under 
the assumption that the sums gi and gj are fixed. Hence, we can further rewrite the probability 

as 



k* 



• ^JJ Pr [Vj : V6 G B id : dist(6, iljjO/i, yj)) < e | ft = j/j]^ dy k * . . . dyi . (1) 

Now let us consider the probability 

Pr[Vj : Vfe € £y : dist(6, Hijfa, Vj )) < e \ 9l = Vi ] (2) 

for a fixed i and for fixed values y, L and To simplify the notation, let Bi = {b±, . . . , b m }, 
and let the corresponding hyperplanes (which are fixed because yi and the yfs are given) be 
Hi, . . . ,H m . (A hyperplane may occur several times in this list if more than one point goes 
from Ci to some cluster Cj.) Then the probability simplifies to 

Pr [Vj : dist(b j,Hj) < e | g { = y t ] . 

We distinguish between the cases m = 1 and m > 1. 

Case 1: m — 1. The probability degenerates to 

□ r,.,,, „w I , i fl if 2/i is e-close to J9i, 
Pr dist(6i, Hi) < s \gi = b x = yd = < . (3) 

10 otherwise. 

So, given gi = yi, there is no randomness in the event that yi is £-close to Hi. Choose ji such 
that bi S Bij i and denote by Ii(yi,yj.) the indicator variable defined in We replace the 
corresponding probability in ([I]) by Ij (y^ , y^ ) . 

Case 2: m > 1. Let Hj(e) be the slab of width 2e around i.e., Hj(e) = {a; G I d 
dist(x, f/j) < e}. Let / be the joint density of the random vectors bi, . . . , b m -i,gi. Then the 
probability ^) can be bounded from above by 

f(zi,...,z m -i,yi) 



dz m _i ...dzi . 

" Zm — 1 

Now let be the density of the random vector bi. This allows us to rewrite the joint density, 
and we obtain the upper bound 

f f /lOl) ■ ■ ■ ■ ■ /m-lOm-l) • f m {Vi ~ T!j=i z j) , 

/ •••/ f j s dz m _i...dzi 



< j. — _^ / •••/ /i(zi) • ... -/ m _i(z m _i)dz m _i...d2;i 

'zi£Hi(e) J2 m _ieH m _i(e) 



1 / m_1 /" 

IT / /iteJd* 



~ f 9l {yi) -<? d Vo- 



1 /ex" 1 - 1 



s 



The first inequality follows from f m (-) < cr~ d , and the last inequality follows because the 
probability that a Gaussian random vector assumes a position within distance e of a given 
hyperplane is at most e/a. 

Now we plug the bounds derived in Cases 1 and 2 into ([1]). Let k= be the number of clusters 
Ci with \Bi\ = 1 , and let us assume that these are the clusters C\ , . . . , Ck= ■ Let 

k> = \{i | |J3i| > 1}| and m = (\ B i\ ~ X ) » 

i,[Bi[>l 

that is, fc> is the number of clusters that lose more than one point, and mn! is the number of 
points leaving those clusters minus fc>, i.e., the number of points whose randomness we exploit. 
Note that k* = &;> + k=. Then (JTJ) can be bounded from above by 

where f gi cancels out for i > k=. Observe that for fixed yj i the term 



v 



fgiiVi) -liiyuVj^dyi (5) 



describes the probability that the point b with B{ = {b} lies in T> and is at distance at most 
e from the bisector of Ci and Cj i . For yj { € T>, the point b can only lie in the hypercube T> if 
it has a distance of at most \fd(\ + 2D) < S^/dD from y^. Hence, we can use Lemma [7] with 
8 = 3\fdD to obtain that the probability in ([5]) is upper bounded by 



2Vn3VdDe < A\/nVdDe 



a 



Since there can be circular dependencies like, e.g., ji = i' and jV = i, it might not be possible 
to reorder the integrals in (j3J) such that all terms become isolated as in ([5]). We resolve 
these dependencies by only considering a subset of the clusters. To make this more precise, 
consider a graph whose nodes are the clusters and that has a directed edge from C% to Cj if 
\Bi\ = l-Bjjl = 1, i.e., for every i with |£>j| = 1, there is an edge from Ci to Cj i . This graph 
contains exactly k= edges and we can identify a subset of k' > k=/2 edges that is cycle- free. 
The subset C of clusters that we consider consists of the tails of these edges. Since every node 
in the graph has an out-degree of at most one, C consists of exactly k' clusters. For each cluster 
Ci not contained in C , we replace the corresponding ^(y^yjj by the trivial upper bound of 
1. Without loss of generality, the identified subset C consists of the clusters Ci, . . . ,Cy and it 
is topologically sorted in the sense that % < ji for all i £ {1, . . . , k'}. Given this, (|3|) can be 
bounded from above by 



1 re 



cr' t>u \a 



k>d \„ I I I J 9 k 



fgu ivk')- h' ivk' , Vj y )■■■ I fgi (yi) ■ ii (yi, Vh ) • ■ ■ 

Jv 



v Jv Jv Jv 



k*~k' integrals k' integrals 

Evaluating this formula from right to left according to Lemma [7] yields 

k' 



kyd+m'+k' ' 
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where the term (3D) dk comes from the k* — k' < k integrals over yk'+i, ■ ■ ■ ,Vk*'- Each of these 
integrals is over the hypercube T>, which has a volume of (2D + l) d < (3D) d . The definitions 
directly yield that fc> < k and m' + k! < m! + k= < £. Furthermore, 



m > 



T,i,\Bi\>l \ B i\ 



and k' > 



k = 



JLi,\Bi\=l \ B i\ 



2 - 2 2 

implies m' + k'/2 > £/4. Altogether, this yields the desired upper bound of 

t .e/4 



(3D) 



dk 



n 



VdDj 



ykd+i 



_ (3D^ kd ( 32n 2 dD 2 e y /4 



for the probability of the event £ A ' . 



□ 



Lemma 7. Let o € M. d and p 6 M. d be arbitrary points and let r denote a random point chosen 
according to a d-dimensional normal distribution with arbitrary mean and standard deviation a. 
Moreover, let I G {0, ... ,n— 1}, and let 



i 



■ p + 



1 



• r 



* £+1 £+1 
be a convex combination of p and r. Then the probability that r is 

(i) at a distance of at most 5 > from o and 

(ii) at a distance of at most e > from the bisector of o and q 

is bounded from above by 

2v / nfe 



Proof. For ease of notation, we assume that o is the origin of the coordinate system, i.e., 
o = (0, ...,0). Due to rotational symmetry, we can also assume that p = (0,P2, 0, . . . , 0) 
for some p2 £ R. Let r = (n, . . . , ra), and assume that the coordinates r2,---,ra are fixed 
arbitrarily. We only exploit the randomness of the coordinate r\ , which is a one-dimensional 
Gaussian random variable with standard deviation a. The condition that r has a distance of 
at most e from the bisector of o and q can be expressed algebraically as 



q + o 



-r € [-£,£] ■ 



q-o 
\\q - °ll 

Since o = (0, . . . , 0), this simplifies to 

|4' (! " r ) G [ ~ £,£] ^ «■ (§ _r ) 6 Hklk'NM ■ 

Since r-j is fixed for i / 1, also the coordinates % of q are fixed for i 7^ 1. Setting 2A = l/(£ + l) 
and exploiting that the first coordinate of p is 0, we can further rewrite the previous expression 
as 

/2Arx\ /(A - l) ri \ 
q 2 /2 - r 2 



Q2 



\ Id I 



\<i\\ £ i \m\ £ \ 



\q d /2 - r d ) 
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which is equivalent to 



2A(A-l)r? + ^(< Zi /2-r i ) G [-||g||e, ||g||e] 



Since the coordinates qi and ri are fixed for i / 1, this implies that r can only be at distance 
e from the bisector of p and q if r\ falls into a fixed interval of length 



2A(1 - A) 1 



2(1+1) 



As we only consider the event that q has a distance of at most 8 from o, we can replace \\q\\ 
by 5 in the expression above, leaving us with the problem to find an upper bound for the 
probability that the random variable r\ assumes a value in a fixed interval of length at most 
An5e. For this to happen, n has to assume a value in one of two intervals, each of length at 
most 2\/ nSe. Now T\ follows a Gaussian distribution with standard deviation a. This means 
that the corresponding density is bounded from above by (\^2~Tra)~ 1 . Thus, the probability of 
this event is at most 

-^=- < — • □ 

V27TO- v 



3 An Upper Bound 

Lemma [5] yields an upper bound on the number of iterations that fc-means needs: Since there 
are only few points close to hyperplanes, eventually a point switches from one cluster to another 
that initially was not close to a hyperplane. The results of this section lead to the proof of 
Theorem [2] in Section [7.31 

First, we bound the number of iterations in terms of the distance A of the closest cluster 
centers that occur during the run of k- means. 

Lemma 8. For every a 6 [k], with a probability of at least 1 — 31F -1 , every sequence of k kd / a + l 
consecutive steps of the k-means algorithm (not including the first one) reduces the potential 
by at least 

e 2 ■ min{A 2 , 1} 
36dD 2 k kd / a ' 

where A denotes the smallest distance of two cluster centers that occurs during the sequence 
and e is defined as in Lemma\f^ 

Proof. Consider the configuration directly before the sequence of steps is performed. Due to 
Lemma [5j the probability that more than kd/a points are within distance e of one of the 
bisectors is at most 2W~ 1 . Additionally, only with a probability of at most W -1 there exists 
a point from X that does not lie in the hypercube T>. Let us assume in the following that none 
of these failure events occurs, which is the case with a probability of at least 1 — 3W -1 . 

These points can assume at most k kd l a different configurations. Thus, during the considered 
sequence, at least one point that is initially not within distance e of one of the bisectors must 
change its assignment. Let us call this point x, and let us assume that it changes from cluster 
C\ to cluster C<i- Furthermore, let 5 be the distance of the centers of C\ and C2 before the 
sequence, and let c\ and C2 be the positions of the centers before the sequence. We distinguish 
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two cases. First, if x is closer to C2 than to c\ already in the beginning of the sequence, then 
x will change its assignment in the first step. Let v = | | ^~^ | be the unit vector in C2 — c\ 
direction. We have C2 — c\ = 6v and (2x — c\ — C2) • v = av for some a > 2e. Then the switch 
of x from C\ to C2 reduces the potential by at least 

,,9 , ,,9/ \ / s e 2 -min|A 2 ,l) 

\\x - Cl \\ 2 - \\x - c 2 || 2 = (2x - Cl - c 2 ) • (c 2 - Cl ) > 2 £ £ > 2eA > , 

where the last inequality follows from e < 1 and D > 1. This completes the first case. Second, 
if x is closer to c\ than to C2, then 

\\x — C2 1 1 2 — \\x — ci 1 1 2 > 2e5 , 

and hence, 

2e5 e5 

\\X — C2 — X — Cl > j. [i [i ry > 



C 2 || + \\x - Cl|| 36,0^' 

In this case, x can only change to cluster C2 after at least one of the centers of C\ or C2 has 
moved. Consider the centers of C\ and C2 immediately before the reassignment of x from C\ 
to C2. Let c' x and c' 2 denote these centers. Then 

||x — Ci|| — \\x — c 2 || > U . 

By combining these observations with the triangle inequality, we obtain 

||ci - + ||c 2 - c' 2 || > (j\x - CiH - \\x - Cl||) + (j\x - c 2 || - \\x - 4||) 

= (||s- c'lll - \\x-c' 2 \\) + (\\x-c 2 \\ - \\x-cx\\) > 



Vd3D 



This implies that one of the centers must have moved by at least e5 / (6V~dD) during the 
considered sequence of steps. Each time the center moves by some amount £, the potential 
drops by at least £ 2 (see Lemma [TT|h Since this function is concave, the smallest potential 
drop is obtained if the center moves by e5 / \ l o^fdDk kd / a ) in each iteration. Then the decrease 
of the potential due to the movement of the center is at least 

k M, a ( e8 \\ e*A* 

\6VdDk kd / a J ~ 36dD 2 k kd / a ' 

which concludes the proof. □ 

In order to obtain a bound on the number of iterations that fc-means needs, we need to 
bound the distance A of the closest cluster centers. This is done in the following lemma, which 
exploits Lemma El The following lemma is the crucial ingredient of the proof of Theorem El 

Lemma 9. Let a £ [k] be arbitrary. Then the expected number of steps until the potential 
drops by at least 1 is bounded from above by 

yk^-nkd 1 '^^ 2 



as 



for a sufficiently large absolute constant 7. 
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Proof. With a probability of at least 1 — 3W , the number of iterations until the potential 
drops by at least 

e 2 • min{A 2 , 1} 
36dD 2 k kd / a 

is at most k kd l a + 1 due to Lemma El We estimate the contribution of the failure event, which 
occurs only with probability 3VF" 1 , to the expected running time by 3 and ignore it in the 
following. Let T denote the random variable that equals the number of sequences of length 
fckd/a _|_ y U ntil the potential has dropped by one. 
The random variable T can only exceed t if 



min{A 2 , 1} < 



3MD 2 k kd l a 
T^t 

leading to the following bound on the expected value of T: 

w , w 



E [T] = VPr[T >t]< Pi 
t=i J o 



min{A% 1} < 



36dD 2 k kd / a 



dt 



<t' + 



Pr 



A < 



6VdDk kd /^ 
e ■ s/i 



dt , 



for 



t' 



' (2Ad + VQ)n A VdDk kd ^ 2a ^ 



as 



Let us consider a situation reached by fc-means in which there are two clusters C\ and C2 
whose centers are at a distance of 5 from each other. We denote the positions of these centers 
by ci and 02- Let H be the bisector between c\ and C2. The points c\ and C2 are the centers 
of mass of the points assigned to C\ and C2, respectively. From this, we can conclude the 
following: for every point that is assigned to C\ or C2 and that has a distance of at least 6 
from the bisector H, as compensation another point must be assigned to C\ or C2 that has a 
distance of at most 5/2 from H. Hence, the total number of points assigned to C\ or C2 can be 
at most twice as large as the total number of points assigned to C\ or C2 that are at a distance 
of at most 5 from H. Hence, there can only exist two centers at a distance of at most 5 if one 
of the following two properties is met: 

1. There exists a hyperplane from which more than 2d points have a distance of at most 8. 

2. There exist two subsets of points whose union has cardinality at most Ad and whose 
centers of mass are at a distance of at most 5. 

The probability that one of these events occurs can be bounded as follows using a union bound 
and Lemma [T3l (see also Arthur and Vassilvitskii [3, Proposition 5.6]): 



Pr[A < 5} < n 2d ( ) + (2n) 



Id 



< 



Hence, 



Pr 



A < 



6VdDk kd /^ 
e ■ Vt 



a 



< 



(4d + 16)n 4 5 



a 
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and, for d > 3, we obtain 



Vt> 



,1 



E[T]<t'+ I [^ \ dt 



v 



<t' + t" 1 ' 2 
For d = 2, we obtain 



1 



{-d/2 + 1) • t^ 2 ' 1 



OO 7 

= • i' < 2nnkd ■ t' . 

V d ~ 2 ~ 



E 



/•VK / /Jj\ d 

[T]<f + ( j dt < t! + t! ■ [ln{t)]Y = t' ■{! + \n(W)) < 2nnkd ■ t' . 



Altogether, this shows that the expected number of steps until the potential drops by at 
least 1 can be bounded from above by 

2 + + l) ■ 2,nM • (< 24<i + % '"-f PtM/f2 ° ) ) 2 , 

which can, for a sufficiently large absolute constant 7, be bounded from above by 



1 -k 2kd ' a -nkd- f- 



as 



□ 



4 Iterations with at most y/k Active Clusters 

In this and the following section, we aim at proving the main lemmas that lead to Theorem HJ 
which we will prove in Section 17.21 To do this, we distinguish two cases: In this section, we 
deal with the case that at most \fk are active. In this case, either few points change clusters, 
which yields a potential drop caused by the movement of the centers. Or many points change 
clusters. Then, in particular, many points switch between two clusters, and not all of them 
can be close to the hyperplane bisecting the corresponding centers, which yields the potential 
drop in this case. 

We define an epoch to be a sequence of consecutive iterations in which no cluster center 
assumes more than two different positions. Equivalently, there are at most two different sets 
C'^C'l that every cluster Cj assumes. The obvious upper bound for the length of an epoch is 
2 k , which is stated also by Arthur and Vassilvitskii [3]: After that many iterations, at least 
one cluster must have assumed a third position. For our analysis, however, 2 k is too big, and 
we bring it down to a constant. 

Lemma 10. The length of any epoch is less than four. 

Proof. Let x be any data point that changes from one cluster to another during an epoch, and 
let ii,i2, ■■■ ,H be the indices of the different clusters to which x belongs in that order. (We 
have ij ^ ij+i, but x can change back to a cluster it has already visited. So, e.g., ij = ij + 2 is 
allowed.) For every u, we then have two different sets C' and C'/ with centers c' and d- such 
that x S £! \ C';,. Since x belongs always to at exactly one cluster, we have Cj. = C'. for all 

3 3 3 

except for one j for which d- = C'[. . Now assume that I > 4. Then, when changing from Ci 1 
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to Ci 2 , we have ||x — c[ 2 \\ < || x — c' i4 || since x prefers Cj 2 over Cj 4 and, when changing to Cj 4 , we 
have \\x — d i4 \\ < ||x — c' i2 ||. This contradicts the assumption that I > 4. 

Now assume that x does not change from to Cj- +1 for a couple of steps, i.e., x waits until 
it eventually changes clusters. Then the reason for eventually changing to Ci j+1 can only be 
that either Cj. has changed to some Cj., which makes x prefer Cj +1 . But, since Cj. / Cf. and 



x G , we have a third cluster for Cj . . Or Ci j+1 has changed to C^ +1 , and x prefers Ci j+1 . But 
then Ci j+1 7^ C[. and x ^ Cj j+1 , and we have a third cluster for Cj j+1 . 

We can conclude that x visits at most three different clusters, and changes its cluster in 
every iteration of the epoch. Furthermore, the order in which x visits its clusters is periodic 
with a period length of at most three. Finally, even a period length of three is impossible: 
Suppose x visits Cj 15 Cj 2 , and C lz . Then, to go from Ci j to Ci j+1 (arithmetic is modulo 3), we 
have llx — d II < ||x — d II. Since this holds for j = 1, 2, 3, we have a contradiction. 

This holds for every data point. Thus, after at most four iterations either fc-means termi- 
nates, which is fine, or some cluster assumes a third configuration, which ends the epoch, or 
some clustering repeats, which is impossible. □ 

Similar to Arthur and Vassilvitskii [3], we define a key-value to be an expression of the 
form K = | • cm(S'), where s,t G N, s < n 2 , t < n, and S C X is a set of at most Ad\f~k 
points. (Arthur and Vassilvitskii allow up to Adk points.) For two key- values K\,K2, we write 
K\ = K2 if and only if they have identical coefficients for every data point. 

We say that X is 5 -sparse if, for every key- values K\, K2, K3, K4 with WK1+K2 — K3 — K4W < 
5, we have K\ + K2 = K% + K4. 

Lemma 11. The probability that the point set X is not 6 -sparse is at most 



n 



a 



Proof. Let us first bound the number of possible key-values: There are at most n 3 possibil- 
ities for choosing s and t and possibilities for choosing the set S. Thus, there are at 
most n 16(J v / ^+ 12 possibilities for choosing four key- values K\, . . . , K4. We fix K\, . . . , K4 arbi- 
trarily. The rest follows from a union bound and the proof of Proposition 5.3 of Arthur and 
Vassilvitskii [3]. □ 

After four iterations, one cluster has assumed a third center or /c-means terminates. This 
yields the following lemma (see also Arthur and Vassilvitskii [3, Corollary 5.2]). 

Lemma 12. Assume that X is 5-sparse. Then, in every sequence of four consecutive iterations 
that do not lead to termination and such that in every of these iterations 

• at most y/k clusters are active and 

• each cluster gains or loses at most 2d\fk points, 

the potential decreases by at least 

We say that X is e-separated if, for every hyperplane H, there are at most 2d points in X that 
are within distance e of H. The following lemma, due to Arthur and Vassilvitskii [3, Proposition 
5.6], shows that X is likely to be e-separated. 
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n 2d 



d 



Lemma 13 (Arthur and Vassilvitskii [3]). The point set X is not e-separated with a probability 
of at most 

W • 

Given that X is e-separated, every iteration with at most \f~k~ active clusters in which one 
cluster gains or loses at least 2dxfk points yields a significant decrease of the potential. 

Lemma 14. Assume that X is e-separated. For every iteration with at most \/k active clusters, 
the following holds: If a cluster gains or loses more than 2d\fk points, then the potential drops 
by at least 2e 2 /n. 

This lemma is similar to Proposition 5.4 of Arthur and Vassilvitskii [3]. We present here a 
corrected proof based on private communication with Vassilvitskii. 

Proof. If a cluster Ci gains or loses more than 2dyk points in a single iteration with at most 
y/k active clusters, then there exists another cluster Cj with which Cj exchanges at least 2d+l 
points. Since X is e-separated, one of these points, say, x, must be at a distance of at least e 
from the hyperplane bisecting the cluster centers Cj and Cj. Assume that x switches from Cj 
to Cj. 

Then the potential decreases by at least ||q — x\\ 2 — \\cj — x\\ 2 = (2x — Ci — Cj) ■ (cj — Cj). Let 
v be the unit vector in Cj — Ci direction. Then (2x — Ci — Cj) ■ v > 2e. We have Cj — Ci = av for 
a = \\cj — Ci\\, and hence, it remains to bound \\cj — Cj|| from below. If we can prove a > e/n, 
then we have a potential drop of at least (2x — Cj — Cj) ■ av > a2e > 2e 2 /n as claimed. 

Let H be the hyperplane bisecting the centers of Cj and Cj in the previous iteration. While H 
does not necessarily bisect q and Cj , it divides the data points belonging to Cj and Cj correctly. 
In particular, this implies that ||cj — Cj\\ > dist (cj, H) + dist(cj, H). 

Consider the at least 2d + 1 data points switching between Cj and Cj. One of them must 
be at a distance of at least e of H since X is e-separated. Let us assume w.l.o.g. that this 
point switches to Cj. This yields dist(cj,i?) > e/n since Cj contains at most n points. Thus, 
|| Cj — Cj\\ > e/n, which yields a > e/n as desired. □ 

Now set 5i = n ^ 16 ^( 16 +*)-v / fe . u anc q Ei = a . n ~4-«v / fc Then the probability that the instance 
is not 5j-sparse is bounded from above by 

n WdVk+12+4d~16d~(16+i)d-Vk < n ~idVk 

The probability that the instance is not ej-separated is bounded from above by (we use d < n 
and 4 < n) 

n Ad-Ad-idVk _ n ~idVk 

We abbreviate the fact that an instance is <5j-sparse and ej-separated by i-nice. Now Lemmas fl2l 
and 1141 immediately yield the following lemma. 

Lemma 15. Assume that X is i-nice. Then the number of sequences of at most four consec- 
utive iterations, each of which with at most \fk active clusters, until the potential has dropped 
by at least 1 is bounded from above by 

( min i ~ • n -36-(32+2*)v^ . 2 2a 2 . n ~9-i2Vk \\ ' < U^ +2 ^ = _ 



V I 4 " " J J - a 2 

for a suitable constant c. 
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The first term comes from Si, which yields a potential drop of at least Sf /(4n 4 ). The second 
term comes from e,, which yields a drop of at least 2e 2 /n. 

Putting the pieces together yields the main lemma of this section. 

Lemma 16. The expected number of sequences of at most four consecutive iterations, each of 
which with at most \fk active clusters, until the potential has dropped by at least 1 is bounded 
from above by 

poly (n^, i 

Proof. The probability that it takes more than Si such sequences is bounded from above by 
the probability that the instance is not z-nice, which is bounded from above by 2n~ td '^. Let 
T be the random variable of the number of sequences of at most four consecutive iterations, 
each with at most Vk active clusters, that it takes until we have a potential drop of at least 1. 

We observe that /c-means runs always at most W < n Kkd iterations. This yields that we 
have to consider i only up to K\f~k. We assume further that d > 2. Putting all observations 
together yields the lemma: 

K\/k 

E[T] < + Pr [not i-nice, but (i + l)-nice] • S i+1 + W ■ Pr [not nVk-nice] 

i=0 

< s + j2 Pr [ not *- nice ] • + nKkd • 2n ~ 

i=0 



-ndk 



c-Vk «v^fc T) (c+2+2i)-v / fe 
n ^ - -id-Vk n 



a 2 ^ a 2 



n 

< 



i=0 

c-Vk ^ n (c+2)Vk / i 



+ ^2.^- J - + 2<poly n^,A . □ 



a 2 * — ' \ n 



5 Iterations with at least y/k Active Clusters 

In this section, we consider steps of the /c-means algorithm in which at least \fk different 
clusters gain or lose points. The improvement yielded by such a step can only be small if 
none of the cluster centers changes its position significantly due to the reassignment of points, 
which, intuitively, becomes increasingly unlikely the more clusters are active. We show that, 
indeed, if at least \fk clusters are active, then with high probability one of them changes its 
position by n -°(^k) ^ yielding a potential drop in the same order of magnitude. 

The following observation, which has also been used by Arthur and Vassilvitskii [3], relates 
the movement of a cluster center to the potential drop. 

Lemma 17. If in an iteration of the k-means algorithm a cluster center changes its position 
from c to d , then the potential drops by at least \\c — c'\\ 2 . 

Proof. The potential is defined as 

Ell _ II 2 
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where c x denotes the center that is closest to x. We can rewrite this as 

£ X> - c\\ 2 = E ( E ii x - cm (^)ii 2 + \ x d ■ \\™(Xc) - cf ) , 

where X c C X denotes all points from X whose closest center is c and where cm(A c ) denotes 
the center of mass of X c . 

Let us consider the case that one center changes its position from c to c'. Then d must be 
the center of mass of X c . Furthermore, \X C \ > 1. Hence, the potential drops by at least 

II I v \ Il2 n iv \ 'Il2 n / ||2 n / /||2 n / ||2 r— i 

||cm(A c ) — c|| — ||cm(A c ) — c || = ||c — c|| — ||c — c || = ||c — c|| . U 
Now we are ready to prove the main lemma of this section. 

Lemma 18. The expected number of steps with at least \fk active clusters until the potential 
drops by at least 1 is bounded from above by 

poly ( n^, - 



Proof. We consider one step of the fc-means algorithm with at least \f~k~ active clusters. Let e be 
defined as in Lemma [5] for a = 1. We distinguish two cases: Either one point that is reassigned 
during the considered iteration has a distance of at least e from the bisector that it crosses, 
or all points are at a distance of at most e from their respective bisectors. In the former case, 
we immediately get a potential drop of at least 2eA, where A denotes the minimal distance of 
two cluster centers. In the latter case, Lemma [5] implies that with high probability less than 
kd points are reassigned during the considered step. We apply a union bound over the choices 
for these points. In the union bound, we fix not only these points but also the clusters they 
are assigned to before and after the step. We denote by Ai the set of points that are assigned 
to cluster Ci in both configurations and we denote by Bi and B[ the sets of points assigned to 
cluster Ci before and after the step, respectively, except for the points in Ai. Analogously to 
Lemma [5j we assume that the positions of the points in A\ U . . . U A^ are fixed adversarially, 
and we apply a union bound on the different partitions A±, . . . , A^ realizable. Altogether, we 
have a union bound over less than 



n 



events. Let q be the position of the cluster center of Ci before the reassignment, and let c£ be 
the position after the reassignment. Then 

\Ai\ -cm(Ai) + \Bi\ ■cm(Bj) 



\Ai\ + \Bi\ 

where cm(-) denotes the center of mass of a point set. Since c[ can be expressed analogously, 
we can write the change of position of the cluster center of Ci as 

Ci-Ci = \Ai\ ■ cm(Ai) | . | : , D , - , . , : , D/ , + 



Ai\ + \Bi\ \A i \ + \B[\J \Ai\ + \Bi\ l^l + l^l 

Due to the union bound, cm(^4j) and \Ai\ are fixed. Additionally, also the sets B{ and B[ are 
fixed but not the positions of the points in these two sets. If we considered only a single center, 
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then we could easily estimate the probability that ||q — c^|| < (3. For this, we additionally fix 
all positions of the points in Bi U B[ except for one of them, say 6j. Given this, we can express 
the event ||cj — c£|| < (3 as the event that bi assumes a position in a ball whose position depends 
on the fixed values and whose radius, which depends on the number of points in \Ai\, \Bi\, and 
l-B'l, is not larger than n(3. Hence, the probability is bounded from above by 



d 



n/3 



a 



However, we are interested in the probability that this is true for all centers simultaneously. 
Unfortunately, the events are not independent for different clusters. We estimate this prob- 
ability by identifying a set of £/2 clusters whose randomness is independent enough, where 
I > \fk is the number of active clusters. To be more precise, we do the following: Consider 
a graph whose nodes are the active clusters and that contains an edge between two nodes if 
and only if the corresponding clusters exchange at least one point. We identify a dominating 
set in this graph, i.e., a subset of nodes that covers the graph in the sense that every node 
not belonging to this subset has at least one edge into the subset. We can assume that the 
dominating set, which we identify, contains at most half of the active clusters. (In order to find 
such a dominating set, start with the graph and throw out edges until the remaining graph is 
a tree. Then put the nodes on odd layers to the left side and the nodes on even layers to the 
right side, and take the smaller side as the dominating set.) 

For every active center C that is not in the dominating set, we do the following: We assume 
that all the positions of the points in Bi U B[ are already fixed except for one of them. Given 
this, we can use the aforementioned estimate for the probability of ||q — c^|| < (3. If we iterate 
this over all points not in the dominating set, we can always use the same estimate; the reason 
is that the choice of the subset guarantees that, for every node not in the subset, we have a 
point whose position is not fixed yet. This yields an upper bound of 



nf3_ 
a 



dl/2 



Combining this probability with the number of choices in the union bound yields a bound of 

di/2 / R sdVk/2 



n (K+3)-kd . < n (K+3)-kd . ' 



For 



n 



the probability can be bounded from above by n~ Kkd < W —1 . 

Now we also take into account the failure probability of 2W~ l from Lemma [3 This yields 
that, with a probability of at least 1 — 3VF -1 , the potential drops in every iteration, in which 
at least y/k clusters are active, by at least 

T := min{2eA,/3 2 | > mini — - — 1 

> min | A • poly (n _1 , a) , poly ^n _v ^, o^J j 
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since d <n and D is polynomially bounded in a and n. The number T of steps with at least 
Vk active clusters until the potential has dropped by one can only exceed t if T < 1/t. Hence, 



E [T] < ^ Pr [ T ^ f ] W <3 

1" 



t=i 



t=o 



Pr[T > t] dt 



< 4 + / Pr 

Jt=i 

<4 + /T 2 + 



< 4 + /T 2 + 



r < 



dt < 4 + [3~ z + 



Pr 



< 4 + /3 _z + / ami 

Jt=f3~ 2 



t=/3~ 2 
oo 

t=/3- 2 
oo 



Pr 



A < - ■ poly ( n, 



poo 


Pr 


Jt=f3- 


2 


K 


dt 




dt 







r < 



dt 



/ (4d + 16) •n 4 -poly(n,o-- 1 ) 
11 T~o 



dt = poly ( n^, 



1 



a 



where the integral is upper bounded as in the proof of Lemma 



□ 



6 A Polynomial Bound in One Dimension 

In this section, we consider a one-dimensional set X C ]R of points. The aim of this section 
is to prove that the expected number of steps until the potential has dropped by at least 1 is 
bounded by a polynomial in n and 1/cr. 

We say that the point set X is e-spreaded if the following conditions are fulfilled: 

• There is no interval of length e that contains three or more points of X. 

• For any four points x\, X2, £3, £4, where X2 and X3 may denote the same point, we have 
\x\ — X2\> £ or |x3 — £4) > e. 

The following lemma justifies the notion of e-spreadedness. 

2 

Lemma 19. Assume that X is e-spreaded. Then the potential drops by at least in every 
iteration. 

Proof. Let Ci be the left-most active cluster, and let Cj be the right-most active cluster. 

We consider Ci first. Ci exchanges only points with the clusters to its right, for otherwise it 
would not be the leftmost active cluster. Thus, it cannot gain and lose points simultaneously. 
Assume that it gains points. Let Ai be the set of points of Cj before the iteration, and let 
Bi be the set of points that it gains. Obviously, mini?, > maxvlj. If Bi U Ai contains at 
least three points, then we are done: If \Ai\ > 2, then we consider the two rightmost points 
x\ < X2 of Ai and the leftmost point X3 of Bi. These points are not within a common interval 
of size e. Hence, £3 has a distance of at least e/2 from the center of mass cm(Aj) because 
dist(xi,X3) > e, x\ < X2 < £3, and cm(Ai) < {x\ + X2)/2. Hence, 

cm(Si) > cm(^) + e -. 

Thus, the cluster center moves to the right from cm(^4j) to 

= |^|.cm(^) + |^|.cm(^) I^U^I-cmW + l^l-f ^ 
\AiUBA ~ \AiUBi\ ~ y ' 2n 
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The case \Ai\ = 1 and \Bi\ > 2 is analogous. The same holds if cluster Cj switches from Aj to 
Aj U Bj with \Aj U Bj\ > 3, or if Ci or Cj lose points but initially have at least three points. 
Thus, in these cluster moves by at least e/(2n), which causes a potential drop by at 

least e 2 /(4n 2 ). 

It remains to consider the case that \Ai U Bi\ = 2 = \Aj U Bj\. Thus, Ai = {aj}, Bi = {bi}, 
and also Aj = {aj}, Bj = {bj}. We restrict ourselves to the case that Ci consists only of 
and gains bi and that Cj has aj and bj and loses bj. If only two clusters are active, we have 
bi = bj, and we have only three different points. Otherwise, all four points are distinct. This 
allows us to bring e-spreadedness into play. We have either |aj — bi\ > e or | aj — bj \ > e. But 
then either the center of Ci or the center of Cj moves by at least e/2, which implies that the 
potential decreases by at least e 2 /4 > e 2 /(An 2 ). □ 

Assume that X is e-spreaded. Then the number of iterations until the potential has dropped 
by at least 1 is at most 4n 2 /e 2 by the lemma above. Let us estimate the probability that X is 
e-spreaded. 

Lemma 20. The 'probability that X is not e-spreaded is bounded from above by 2 "/ . 

Proof. For the first property, let us consider any three points xi,X2,x^, and assume that x\ 
is fixed arbitrarily. Then, in order to share an interval of size e, we must have \xi — x±\ < e 
for i = 2,3. Since X2 and X3 are independent, this happens with a probability of at most 
; ' 1 < (|) . There are at most n 3 < n 4 choices for xi,X2,x%. 



For the second property, consider any x\, . . . , X4, and assume that X2 and x% are fixed. Then 
the probability that \xi — X2I < e and \x% — x±\ < e is at most (|) . There are at most n 4 
choices for x\, . . . , X4. 

Overall, by a union bound, the probability that X is not e-spreaded is at most 2 " 4 / 2 . □ 

Now we have all ingredients for the proof of the main lemma of this section. 

Lemma 21. The number of iterations of k -means until the potential has dropped by at least 1 
is bounded by a polynomial in n and 1/a. 

Proof. Let T be the random variable of the number of iterations until the potential has dropped 
by at least 1. If T > t, then X cannot be e-spreaded with 4n 2 /e 2 < t. Thus, in this case, 
X is not e-spreaded with e = In the worst case, fc-means runs for at most n nk iterations. 
Hence, 



K,k 

X is not — - p-spreaded 



E[T} = Y J H T > t ]<J2 Fi 

t=i t=i 



t=l 



Finally, we remark that, by choosing e = -^z+z, we obtain that the probability that the 
number of iterations until the potential has dropped by at least exceeds a polynomial in n and 
1/a is bounded from above by 0(n~ 2c ). This yields a bound on the running-time of /c-means 
for d = 1 that holds with high probability. 
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7 Putting the Pieces Together 



In the previous sections, we have only analyzed the expected number of iterations until the 
potential drops by at least 1 . To bound the expected number of iterations that k- means needs 
to terminate, we need an upper on the potential in the beginning. To get this, we use the 
following lemma. 

Lemma 22. Let x be a one- dimensional Gaussian random variable with standard deviation a 
and mean \i £ [0, 1]. Then, for all t > 1, 

Pr[x £ [-t, 1 + t]} < a • exp 



For D = \j2o 2 h\(n l+Kkd da) < poly(n, a), the probability that any component of any of the 
n data points is not contained in the hypercube T> = [— D, 1 + D] d is bounded from above by 
n -Kkd < W 1 . This implies that X QT> with a probability of at least 1 — W" 1 . 

In the beginning, we made the assumption that a < 1. While this covers the small values 
of a, which we consider as more relevant, the assumption is only a technical requirement, and 
we can get rid of it: The number of iterations that k- means needs is invariant under scaling 
of the point set X. Now assume that a > 1. Then we consider X scaled down by 1/cr, 
which corresponds to the following model: The adversary chooses points from the hypercube 
[0, l/(j} d C [0, l} d , and then we add d-dimensional Gaussian vectors with standard deviation 
1 to every data point. The expected running-time that fc-means needs on this instance is 
bounded from above by the running-time needed for adversarial points chosen from [0, l] d and 
(7 = 1, which is poly(n) < poly(n, 1/cr). 

7.1 Proof of Theorem [4] 

We obtain a bound that is polynomial in n and 1/a from Lemmas 1211 and l22l First, after one 
iteration, the potential is bounded from above by poly(n, D) = poly(n). If this is not the case, 
we bound the number of iterations by W, which adds W ■ W -1 to the expected number of 
iterations. Second, the expected number of iterations until the potential has dropped by at 
last 1 is bounded by poly(n, 1/cr), which yields a bound of poly(n, 1/cr) until the algorithm 
terminates. This proves Theorem HJ 

The result that the probability that the number of iterations exceeds a polynomial in n and 
1/cr is at most 0(l/poly(n)) follows immediately. 

7.2 Proof of Theorem CD 

In the remainder of this section, we restrict ourselves to d > 2. For d = 1, we already have a 
polynomial bound according to Theorem H] and Section [H 
After poly (n^, l/o") iterations, we have 

• at least poly in , 1/cr) sequences of four consecutive iterations, each of which with at 
most \fk active clusters, or 

• at least poly(n v/ ^, l/u) iterations with at least ^fk active clusters. 
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Thus, by Lemmas [16] and [THJ the expected number of steps until the potential has dropped by 
one is at most 

After the first iteration, the potential is at most nd ■ {2D + l) 2 . As argued above, we can 
restrict ourselves to a < 1, which implies D < poly(n). This yields that the expected number 
of steps until termination is at most 

nd ■ {2D + l) 2 • poly (n^ 1 , ~ J = poly ^n^, ~ 

provided that X C T>. If X ^ "D, we bound the number of iterations by the worst-case bound 
of W, which contributes only W~ l ■ W = 1 to the expected number of iterations. This proves 
Theorem [TJ 

7.3 Proofs of Theorem [2] and Corollary [3] 

We exploit Lemma [9] with a = 2. Then the expected number of iterations until the potential 
has dropped by at least 1 is bounded from above by 

k kd • poly ( n, — 
V * 

Again, after the first iteration, the potential is at most nd ■ {2D + l) 2 < poly(re) with a 
probability of at least 1 — W _1 . This shows that the expected number of iterations, provided 
X C V, is bounded from above by 

a 



k kd ■ poly ( n, - 



The event X T> contributes only 1 to the expected number of iterations as argued in the 
previous section. This completes the proof of Theorem [5J 

If k,dE O { \/log n I log log n) , then k kd < poly(n), which proves Corollary El 



8 Conclusions 

We have proved two upper bounds for the smoothed running-time of the fc-means method: 
The first bound is poly(n v ^, 1/er). The second bound is k kd ■ poly(n, 1/cj), which decouples 
the exponential growth in k and d from the number of points and the standard deviation. In 
particular, this yields a smoothed running-time that is polynomial in n and 1/a for k,d £ 
O { \/log n I log log n) . 

The obvious question now is whether a bound exists that is polynomial in n and 1/a, without 
exponential dependence on k or d. We believe that such a bound exists. However, we suspect 
that new techniques are required to prove it; bounding the smallest possible improvement from 
below might not be sufficient. The reason for this is that the number of possible partitions, 
and thus the number of possible /c-means steps, grows exponentially in k, which makes it more 
likely for small improvements to exist as k grows. 

Finally, we are curious if our techniques carry over to other heuristics. In particular LemmaO 
is quite general, as it bounds the number of points from above that are close to the boundaries 
of the Voronoi partitions that arise during the execution of /c-means. In fact, we believe that 
a slightly weaker version of Lemma [5] is also true for arbitrary Voronoi partitions and not only 
for those arising during the execution of fc-means. This insight might turn out to be helpful in 
other contexts as well. 
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